Computation and Language 48
☆ Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding ECCV'24
While existing research often treats long-form videos as extended short
videos, we propose a novel approach that more accurately reflects human
cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for
Long-Form Video Understanding, a model that simulates episodic memory
accumulation to capture action sequences and reinforces them with semantic
knowledge dispersed throughout the video. Our work makes two key contributions:
First, we develop an Episodic COmpressor (ECO) that efficiently aggregates
crucial representations from micro to semi-macro levels. Second, we propose a
Semantics reTRiever (SeTR) that enhances these aggregated representations with
semantic information by focusing on the broader context, dramatically reducing
feature dimensionality while preserving relevant macro-level information.
Extensive experiments demonstrate that BREASE achieves state-of-the-art
performance across multiple long video understanding benchmarks in both
zero-shot and fully-supervised settings. The project page and code are at:
https://joslefaure.github.io/assets/html/hermes.html.
comment: Accepted to the EVAL-FoMo Workshop at ECCV'24. Project page:
https://joslefaure.github.io/assets/html/hermes.html
☆ SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
Traditional benchmarking in NLP typically involves using static held-out test
sets. However, this approach often results in an overestimation of performance
and lacks the ability to offer comprehensive, interpretable, and dynamic
assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021)
and CheckList (Ribeiro et al., 2020) have addressed these limitations through
behavioral testing of NLP models with test types generated by a multistep
human-annotated pipeline. Unfortunately, manually creating a variety of test
types requires much human labor, often at prohibitive cost. In this work, we
propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large
language models (LLMs) to generate a wide range of test types for a
comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via
LLMs using controlled generation, and then identifies challenging examples by
comparing the predictions made by LLMs with task-specific NLP models. In the
last stage, human experts investigate the challenging examples, manually design
templates, and identify the types of failures the taskspecific models
consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment
analysis and toxic language detection, and show that our framework is effective
in identifying weaknesses of strong models on these tasks. We share our code in
https://github.com/Loreley99/SynthEval_CheckList.
☆ CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models
The digitisation of historical print media archives is crucial for increasing
accessibility to contemporary records. However, the process of Optical
Character Recognition (OCR) used to convert physical records to digital text is
prone to errors, particularly in the case of newspapers and periodicals due to
their complex layouts. This paper introduces Context Leveraging OCR Correction
(CLOCR-C), which utilises the infilling and context-adaptive abilities of
transformer-based language models (LMs) to improve OCR quality. The study aims
to determine if LMs can perform post-OCR correction, improve downstream NLP
tasks, and the value of providing the socio-cultural context as part of the
correction process. Experiments were conducted using seven LMs on three
datasets: the 19th Century Serials Edition (NCSE) and two datasets from the
Overproof collection. The results demonstrate that some LMs can significantly
reduce error rates, with the top-performing model achieving over a 60%
reduction in character error rate on the NCSE dataset. The OCR improvements
extend to downstream tasks, such as Named Entity Recognition, with increased
Cosine Named Entity Similarity. Furthermore, the study shows that providing
socio-cultural context in the prompts improves performance, while misleading
prompts lower performance. In addition to the findings, this study releases a
dataset of 91 transcribed articles from the NCSE, containing a total of 40
thousand words, to support further research in this area. The findings suggest
that CLOCR-C is a promising approach for enhancing the quality of existing
digital archives by leveraging the socio-cultural information embedded in the
LMs and the text requiring correction.
comment: 13 pages, 3 figures, currently under peer review
☆ NDP: Next Distribution Prediction as a More Broad Target
Junhao Ruan, Abudukeyumu Abudula, Xinyu Liu, Bei Li, Yinqiao Li, Chenglong Wang, Yuchun Fan, Yuan Ge, Tong Xiao, Jingbo Zhu
Large language models (LLMs) trained on next-token prediction (NTP) paradigm
have demonstrated powerful capabilities. However, the existing NTP paradigm
contains several limitations, particularly related to planned task
complications and error propagation during inference. In our work, we extend
the critique of NTP, highlighting its limitation also due to training with a
narrow objective: the prediction of a sub-optimal one-hot distribution. To
support this critique, we conducted a pre-experiment treating the output
distribution from powerful LLMs as efficient world data compression. By
evaluating the similarity between the $n$-gram distribution and the one-hot
distribution with LLMs, we observed that the $n$-gram distributions align more
closely with the output distribution of LLMs. Based on this insight, we
introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions
to replace the one-hot targets, enhancing learning without extra online
training time. We conducted experiments across translation, general task,
language transfer, and medical domain adaptation. Compared to NTP, NDP can
achieve up to +2.97 COMET improvement in translation tasks, +0.61 average
improvement in general tasks, and incredible +10.75 average improvement in the
medical domain. This demonstrates the concrete benefits of addressing the
target narrowing problem, pointing to a new direction for future work on
improving NTP.
comment: 8 pages,5 figures
☆ Assessing Generative Language Models in Classification Tasks: Performance and Self-Evaluation Capabilities in the Environmental and Climate Change Domain
This paper examines the performance of two Large Language Models (LLMs),
GPT3.5 and Llama2 and one Small Language Model (SLM) Gemma, across three
different classification tasks within the climate change (CC) and environmental
domain. Employing BERT-based models as a baseline, we compare their efficacy
against these transformer-based models. Additionally, we assess the models'
self-evaluation capabilities by analyzing the calibration of verbalized
confidence scores in these text classification tasks. Our findings reveal that
while BERT-based models generally outperform both the LLMs and SLM, the
performance of the large generative models is still noteworthy. Furthermore,
our calibration analysis reveals that although Gemma is well-calibrated in
initial tasks, it thereafter produces inconsistent results; Llama is reasonably
calibrated, and GPT consistently exhibits strong calibration. Through this
research, we aim to contribute to the ongoing discussion on the utility and
effectiveness of generative LMs in addressing some of the planet's most urgent
issues, highlighting their strengths and limitations in the context of ecology
and CC.
comment: 11 pages, to be published in NLDB 2024
☆ Impact of ChatGPT on the writing style of condensed matter physicists
We apply a state-of-the-art difference-in-differences approach to estimate
the impact of ChatGPT's release on the writing style of condensed matter papers
on arXiv. Our analysis reveals a statistically significant improvement in the
English quality of abstracts written by non-native English speakers.
Importantly, this improvement remains robust even after accounting for other
potential factors, confirming that it can be attributed to the release of
ChatGPT. This indicates widespread adoption of the tool. Following the release
of ChatGPT, there is a significant increase in the use of unique words, while
the frequency of rare words decreases. Across language families, the changes in
writing style are significant for authors from the Latin and Ural-Altaic
groups, but not for those from the Germanic or other Indo-European groups.
comment: 9 pages, 1 figure, 7 tables
☆ Modularity in Transformers: Investigating Neuron Separability & Specialization
Transformer models are increasingly prevalent in various applications, yet
our understanding of their internal workings remains limited. This paper
investigates the modularity and task specialization of neurons within
transformer architectures, focusing on both vision (ViT) and language (Mistral
7B) models. Using a combination of selective pruning and MoEfication clustering
techniques, we analyze the overlap and specialization of neurons across
different tasks and data subsets. Our findings reveal evidence of task-specific
neuron clusters, with varying degrees of overlap between related tasks. We
observe that neuron importance patterns persist to some extent even in randomly
initialized models, suggesting an inherent structure that training refines.
Additionally, we find that neuron clusters identified through MoEfication
correspond more strongly to task-specific neurons in earlier and later layers
of the models. This work contributes to a more nuanced understanding of
transformer internals and offers insights into potential avenues for improving
model interpretability and efficiency.
comment: 11 pages, 6 figures
☆ Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
The use of transformer-based models is growing rapidly throughout society.
With this growth, it is important to understand how they work, and in
particular, how the attention mechanisms represent concepts. Though there are
many interpretability methods, many look at models through their neuronal
activations, which are poorly understood. We describe different lenses through
which to view neuron activations, and investigate the effectiveness in language
models and vision transformers through various methods of neural ablation: zero
ablation, mean ablation, activation resampling, and a novel approach we term
'peak ablation'. Through experimental analysis, we find that in different
regimes and models, each method can offer the lowest degradation of model
performance compared to other methods, with resampling usually causing the most
significant performance deterioration. We make our code available at
https://github.com/nickypro/investigating-ablation.
comment: 9 pages, 2 figures, XAI World Conference 2024 Late-Breaking Work
☆ Bridging Domain Knowledge and Process Discovery Using Large Language Models
Discovering good process models is essential for different process analysis
tasks such as conformance checking and process improvements. Automated process
discovery methods often overlook valuable domain knowledge. This knowledge,
including insights from domain experts and detailed process documentation,
remains largely untapped during process discovery. This paper leverages Large
Language Models (LLMs) to integrate such knowledge directly into process
discovery. We use rules derived from LLMs to guide model construction, ensuring
alignment with both domain knowledge and actual process executions. By
integrating LLMs, we create a bridge between process knowledge expressed in
natural language and the discovery of robust process models, advancing process
discovery methodologies significantly. To showcase the usability of our
framework, we conducted a case study with the UWV employee insurance agency,
demonstrating its practical benefits and effectiveness.
comment: This paper is accepted at the AI4BPM 2024 workshop and to be
published in their proceedings
☆ Towards Tailored Recovery of Lexical Diversity in Literary Machine Translation
Machine translations are found to be lexically poorer than human
translations. The loss of lexical diversity through MT poses an issue in the
automatic translation of literature, where it matters not only what is written,
but also how it is written. Current methods for increasing lexical diversity in
MT are rigid. Yet, as we demonstrate, the degree of lexical diversity can vary
considerably across different novels. Thus, rather than aiming for the rigid
increase of lexical diversity, we reframe the task as recovering what is lost
in the machine translation process. We propose a novel approach that consists
of reranking translation candidates with a classifier that distinguishes
between original and translated text. We evaluate our approach on 31
English-to-Dutch book translations, and find that, for certain books, our
approach retrieves lexical diversity scores that are close to human
translation.
comment: Accepted to EAMT 2024
☆ Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts
We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE)
from trained models. The toolkit can be used for creating a mixture from models
or from adapters. We perform extensive tests and offer guidance on defining the
architecture of the resulting MOE using the toolkit. A public repository is
available.
☆ Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study
Electronic Health Records are large repositories of valuable clinical data,
with a significant portion stored in unstructured text format. This textual
data includes clinical events (e.g., disorders, symptoms, findings, medications
and procedures) in context that if extracted accurately at scale can unlock
valuable downstream applications such as disease prediction. Using an existing
Named Entity Recognition and Linking methodology, MedCAT, these identified
concepts need to be further classified (contextualised) for their relevance to
the patient, and their temporal and negated status for example, to be useful
downstream. This study performs a comparative analysis of various natural
language models for medical text classification. Extensive experimentation
reveals the effectiveness of transformer-based language models, particularly
BERT. When combined with class imbalance mitigation techniques, BERT
outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to
16% for recall of the minority classes. The method has been implemented as part
of CogStack/MedCAT framework and made available to the community for further
research.
☆ Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
Recent advancements in audio generation have been significantly propelled by
the capabilities of Large Language Models (LLMs). The existing research on
audio LLM has primarily focused on enhancing the architecture and scale of
audio language models, as well as leveraging larger datasets, and generally,
acoustic codecs, such as EnCodec, are used for audio tokenization. However,
these codecs were originally designed for audio compression, which may lead to
suboptimal performance in the context of audio LLM. Our research aims to
address the shortcomings of current audio LLM codecs, particularly their
challenges in maintaining semantic integrity in generated audio. For instance,
existing methods like VALL-E, which condition acoustic token generation on text
transcriptions, often suffer from content inaccuracies and elevated word error
rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in
word skipping and errors. To overcome these issues, we propose a
straightforward yet effective approach called X-Codec. X-Codec incorporates
semantic features from a pre-trained semantic encoder before the Residual
Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss
after RVQ. By enhancing the semantic ability of the codec, X-Codec
significantly reduces WER in speech synthesis tasks and extends these benefits
to non-speech applications, including music and sound generation. Our
experiments in text-to-speech, music continuation, and text-to-sound tasks
demonstrate that integrating semantic information substantially improves the
overall performance of language models in audio generation. Our code and demo
are available (Demo: https://x-codec-audio.github.io Code:
https://github.com/zhenye234/xcodec)
☆ MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
In a real-world RAG system, the current query often involves spoken ellipses
and ambiguous references from dialogue contexts, necessitating query rewriting
to better describe user's information needs. However, traditional context-based
rewriting has minimal enhancement on downstream generation tasks due to the
lengthy process from query rewriting to response generation. Some researchers
try to utilize reinforcement learning with generation feedback to assist the
rewriter, but these sparse rewards provide little guidance in most cases,
leading to unstable training and generation results. We find that user's needs
are also reflected in the gold document, retrieved documents and ground truth.
Therefore, by feeding back these multi-aspect dense rewards to query rewriting,
more stable and satisfactory responses can be achieved. In this paper, we
propose a novel query rewriting method MaFeRw, which improves RAG performance
by integrating multi-aspect feedback from both the retrieval process and
generated results. Specifically, we first use manual data to train a T5 model
for the rewriter initialization. Next, we design three metrics as reinforcement
learning feedback: the similarity between the rewritten query and the gold
document, the ranking metrics, and ROUGE between the generation and the ground
truth. Inspired by RLAIF, we train three kinds of reward models for the above
metrics to achieve more efficient training. Finally, we combine the scores of
these reward models as feedback, and use PPO algorithm to explore the optimal
query rewriting strategy. Experimental results on two conversational RAG
datasets demonstrate that MaFeRw achieves superior generation metrics and more
stable training compared to baselines.
☆ Novel-WD: Exploring acquisition of Novel World Knowledge in LLMs Using Prefix-Tuning
Teaching new information to pre-trained large language models (PLM) is a
crucial but challenging task. Model adaptation techniques, such as fine-tuning
and parameter-efficient training have been shown to store new facts at a slow
rate; continual learning is an option but is costly and prone to catastrophic
forgetting. This work studies and quantifies how PLM may learn and remember new
world knowledge facts that do not occur in their pre-training corpus, which
only contains world knowledge up to a certain date. To that purpose, we first
propose Novel-WD, a new dataset consisting of sentences containing novel facts
extracted from recent Wikidata updates, along with two evaluation tasks in the
form of causal language modeling and multiple choice questions (MCQ). We make
this dataset freely available to the community, and release a procedure to
later build new versions of similar datasets with up-to-date information. We
also explore the use of prefix-tuning for novel information learning, and
analyze how much information can be stored within a given prefix. We show that
a single fact can reliably be encoded within a single prefix, and that the
prefix capacity increases with its length and with the base model size.
☆ From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs
Training emotion recognition models has relied heavily on human annotated
data, which present diversity, quality, and cost challenges. In this paper, we
explore the potential of Large Language Models (LLMs), specifically GPT4, in
automating or assisting emotion annotation. We compare GPT4 with supervised
models and or humans in three aspects: agreement with human annotations,
alignment with human perception, and impact on model training. We find that
common metrics that use aggregated human annotations as ground truth can
underestimate the performance, of GPT-4 and our human evaluation experiment
reveals a consistent preference for GPT-4 annotations over humans across
multiple datasets and evaluators. Further, we investigate the impact of using
GPT-4 as an annotation filtering process to improve model training. Together,
our findings highlight the great potential of LLMs in emotion annotation tasks
and underscore the need for refined evaluation methodologies.
comment: to be published in Interspeech 2024
☆ InkubaLM: A small language model for low-resource African languages
Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Aremu Anuoluwapo, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman
High-resource language models often fall short in the African context, where
there is a critical need for models that are efficient, accessible, and locally
relevant, even amidst significant computing and data constraints. This paper
introduces InkubaLM, a small language model with 0.4 billion parameters, which
achieves performance comparable to models with significantly larger parameter
counts and more extensive training data on tasks such as machine translation,
question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM
outperforms many larger models in sentiment analysis and demonstrates
remarkable consistency across multiple languages. This work represents a
pivotal advancement in challenging the conventional paradigm that effective
language models must rely on substantial resources. Our model and datasets are
publicly available \footnote{\url{https://huggingface.co/lelapa}} to encourage
research and development on low-resource languages.
☆ Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling
Self-Consistency (SC) is a widely used method to mitigate hallucinations in
Large Language Models (LLMs) by sampling the LLM multiple times and outputting
the most frequent solution. Despite its benefits, SC results in significant
computational costs proportional to the number of samples generated. Previous
early-stopping approaches, such as Early Stopping Self Consistency and Adaptive
Consistency, have aimed to reduce these costs by considering output
consistency, but they do not analyze the quality of the reasoning paths (RPs)
themselves. To address this issue, we propose Reasoning-Aware Self-Consistency
(RASC), an innovative early-stopping framework that dynamically adjusts the
number of sample generations by considering both the output answer and the RPs
from Chain of Thought (CoT) prompting. RASC assigns confidence scores
sequentially to the generated samples, stops when certain criteria are met, and
then employs weighted majority voting to optimize sample usage and enhance
answer reliability. We comprehensively test RASC with multiple LLMs across
varied QA datasets. RASC outperformed existing methods and significantly
reduces sample usage by an average of 80% while maintaining or improving
accuracy up to 5% compared to the original SC
☆ Tool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios
Recent Text-to-SQL methods leverage large language models (LLMs) by
incorporating feedback from the database management system. While these methods
effectively address execution errors in SQL queries, they struggle with
database mismatches -- errors that do not trigger execution exceptions.
Database mismatches include issues such as condition mismatches and stricter
constraint mismatches, both of which are more prevalent in real-world
scenarios. To address these challenges, we propose a tool-assisted agent
framework for SQL inspection and refinement, equipping the LLM-based agent with
two specialized tools: a retriever and a detector, designed to diagnose and
correct SQL queries with database mismatches. These tools enhance the
capability of LLMs to handle real-world queries more effectively. We also
introduce Spider-Mismatch, a new dataset specifically constructed to reflect
the condition mismatch problems encountered in real-world scenarios.
Experimental results demonstrate that our method achieves the highest
performance on the averaged results of the Spider and Spider-Realistic datasets
in few-shot settings, and it significantly outperforms baseline methods on the
more realistic dataset, Spider-Mismatch.
comment: work in progress
☆ MemLong: Memory-Augmented Retrieval for Long Text Modeling
Recent advancements in Large Language Models (LLMs) have yielded remarkable
success across diverse fields. However, handling long contexts remains a
significant challenge for LLMs due to the quadratic time and space complexity
of attention mechanisms and the growing memory consumption of the key-value
cache during generation. This work introduces MemLong: Memory-Augmented
Retrieval for Long Text Generation, a method designed to enhance the
capabilities of long-context language modeling by utilizing an external
retriever for historical information retrieval. MemLong combines a
non-differentiable ``ret-mem'' module with a partially trainable decoder-only
language model and introduces a fine-grained, controllable retrieval attention
mechanism that leverages semantic-level relevant chunks. Comprehensive
evaluations on multiple long-context language modeling benchmarks demonstrate
that MemLong consistently outperforms other state-of-the-art LLMs. More
importantly, MemLong can extend the context length on a single 3090 GPU from 4k
up to 80k. Our code is available at https://github.com/Bui1dMySea/MemLong
☆ UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches
Large language models (LLMs) have shown remarkable capabilities in generating
user summaries from a long list of raw user activity data. These summaries
capture essential user information such as preferences and interests, and
therefore are invaluable for LLM-based personalization applications, such as
explainable recommender systems. However, the development of new summarization
techniques is hindered by the lack of ground-truth labels, the inherent
subjectivity of user summaries, and human evaluation which is often costly and
time-consuming. To address these challenges, we introduce \UserSumBench, a
benchmark framework designed to facilitate iterative development of LLM-based
summarization approaches. This framework offers two key components: (1) A
reference-free summary quality metric. We show that this metric is effective
and aligned with human preferences across three diverse datasets (MovieLens,
Yelp and Amazon Review). (2) A novel robust summarization method that leverages
time-hierarchical summarizer and self-critique verifier to produce high-quality
summaries while eliminating hallucination. This method serves as a strong
baseline for further innovation in summarization techniques.
♻ ☆ Evaluating Named Entity Recognition: A comparative analysis of mono- and multilingual transformer models on a novel Brazilian corporate earnings call transcripts dataset
Since 2018, when the Transformer architecture was introduced, Natural
Language Processing has gained significant momentum with pre-trained
Transformer-based models that can be fine-tuned for various tasks. Most models
are pre-trained on large English corpora, making them less applicable to other
languages, such as Brazilian Portuguese. In our research, we identified two
models pre-trained in Brazilian Portuguese (BERTimbau and PTT5) and two
multilingual models (mBERT and mT5). BERTimbau and mBERT use only the Encoder
module, while PTT5 and mT5 use both the Encoder and Decoder. Our study aimed to
evaluate their performance on a financial Named Entity Recognition (NER) task
and determine the computational requirements for fine-tuning and inference. To
this end, we developed the Brazilian Financial NER (BraFiNER) dataset,
comprising sentences from Brazilian banks' earnings calls transcripts annotated
using a weakly supervised approach. Additionally, we introduced a novel
approach that reframes the token classification task as a text generation
problem. After fine-tuning the models, we evaluated them using performance and
error metrics. Our findings reveal that BERT-based models consistently
outperform T5-based models. While the multilingual models exhibit comparable
macro F1-scores, BERTimbau demonstrates superior performance over PTT5. In
terms of error metrics, BERTimbau outperforms the other models. We also
observed that PTT5 and mT5 generated sentences with changes in monetary and
percentage values, highlighting the importance of accuracy and consistency in
the financial domain. Our findings provide insights into the differing
performance of BERT- and T5-based models for the NER task.
♻ ☆ Exploring Group and Symmetry Principles in Large Language Models
Large Language Models (LLMs) have demonstrated impressive performance across
a wide range of applications; however, assessing their reasoning capabilities
remains a significant challenge. In this paper, we introduce a framework
grounded in group and symmetry principles, which have played a crucial role in
fields such as physics and mathematics, and offer another way to evaluate their
capabilities. While the proposed framework is general, to showcase the benefits
of employing these properties, we focus on arithmetic reasoning and investigate
the performance of these models on four group properties: closure, identity,
inverse, and associativity. Our findings reveal that LLMs studied in this work
struggle to preserve group properties across different test regimes. In the
closure test, we observe biases towards specific outputs and an abrupt
degradation in their performance from 100% to 0% after a specific sequence
length. They also perform poorly in the identity test, which represents adding
irrelevant information in the context, and show sensitivity when subjected to
inverse test, which examines the robustness of the model with respect to
negation. In addition, we demonstrate that breaking down problems into smaller
steps helps LLMs in the associativity test that we have conducted. To support
these tests we have developed a synthetic dataset which will be released.
♻ ☆ Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset
Hoaxes are a recognised form of disinformation created deliberately, with
potential serious implications in the credibility of reference knowledge
resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that
they often are written according to the official style guidelines. In this
work, we first provide a systematic analysis of similarities and discrepancies
between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a
collection of 311 hoax articles (from existing literature and official
Wikipedia lists), together with semantically similar legitimate articles, which
together form a binary text classification dataset aimed at fostering research
in automated hoax detection. In this paper, We report results after analyzing
several language models, hoax-to-legit ratios, and the amount of text
classifiers are exposed to (full article vs the article's definition alone).
Our results suggest that detecting deceitful content in Wikipedia based on
content alone is hard but feasible, and complement our analysis with a study on
the differences in distributions in edit histories, and find that looking at
this feature yields better classification results than context.
♻ ☆ DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model Transformer for Multimodal Aspect-based Sentiment Analysis
Multimodal aspect-based sentiment analysis (MABSA) enhances sentiment
detection by combining text with other data types like images. However, despite
setting significant benchmarks, attention mechanisms exhibit limitations in
efficiently modelling long-range dependencies between aspect and opinion
targets within the text. They also face challenges in capturing global-context
dependencies for visual representations. To this end, we propose
Kolmogorov-Arnold Networks (KANs) and Selective State Space model (Mamba)
transformer (DualKanbaFormer), a novel architecture to address the above
issues. We leverage the power of Mamba to capture global context dependencies,
Multi-head Attention (MHA) to capture local context dependencies, and KANs to
capture non-linear modelling patterns for both textual representations (textual
KanbaFormer) and visual representations (visual KanbaFormer). Furthermore, we
fuse the textual KanbaFormer and visual KanbaFomer with a gated fusion layer to
capture the inter-modality dynamics. According to extensive experimental
results, our model outperforms some state-of-the-art (SOTA) studies on two
public datasets.
comment: 10 pages, 2 figures, and 3 tables
♻ ☆ Question-Based Retrieval using Atomic Units for Enterprise RAG
Enterprise retrieval augmented generation (RAG) offers a highly flexible
framework for combining powerful large language models (LLMs) with internal,
possibly temporally changing, documents. In RAG, documents are first chunked.
Relevant chunks are then retrieved for a user query, which are passed as
context to a synthesizer LLM to generate the query response. However, the
retrieval step can limit performance, as incorrect chunks can lead the
synthesizer LLM to generate a false response. This work applies a zero-shot
adaptation of standard dense retrieval steps for more accurate chunk recall.
Specifically, a chunk is first decomposed into atomic statements. A set of
synthetic questions are then generated on these atoms (with the chunk as the
context). Dense retrieval involves finding the closest set of synthetic
questions, and associated chunks, to the user query. It is found that retrieval
with the atoms leads to higher recall than retrieval with chunks. Further
performance gain is observed with retrieval using the synthetic questions
generated over the atoms. Higher recall at the retrieval step enables higher
performance of the enterprise LLM using the RAG pipeline.
comment: 14 pages, 5 figures, 5 tables
♻ ☆ Beyond One-Size-Fits-All: Multi-Domain, Multi-Task Framework for Embedding Model Selection
This position paper proposes a systematic approach towards developing a
framework to help select the most effective embedding models for natural
language processing (NLP) tasks, addressing the challenge posed by the
proliferation of both proprietary and open-source encoder models.
comment: It was an initial idea - we plan to work on a detailed version
♻ ☆ Docling Technical Report
Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Nikolaos Livathinos, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Fabian Lindlbauer, Kasper Dinkla, Lokesh Mishra, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, Peter W. J. Staar
This technical report introduces Docling, an easy to use, self-contained,
MIT-licensed open-source package for PDF document conversion. It is powered by
state-of-the-art specialized AI models for layout analysis (DocLayNet) and
table structure recognition (TableFormer), and runs efficiently on commodity
hardware in a small resource budget. The code interface allows for easy
extensibility and addition of new features and models.
♻ ☆ An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought SC
Since the launch of ChatGPT at the end of 2022, generative dialogue models
represented by ChatGPT have quickly become essential tools in daily life. As
user expectations increase, enhancing the capability of generative dialogue
models to solve complex problems has become a focal point of current research.
This paper delves into the effectiveness of the RAFT (Retrieval Augmented
Fine-Tuning) method in improving the performance of Generative dialogue models.
RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and
retrieval augmented generation (RAG), which significantly enhanced the model's
information extraction and logical reasoning abilities. We evaluated the RAFT
method across multiple datasets and analysed its performance in various
reasoning tasks, including long-form QA and short-form QA tasks, tasks in both
Chinese and English, and supportive and comparison reasoning tasks. Notably, it
addresses the gaps in previous research regarding long-form QA tasks and
Chinese datasets. Moreover, we also evaluate the benefit of the
chain-of-thought (CoT) in the RAFT method. This work offers valuable insights
for studies focused on enhancing the performance of generative dialogue models.
comment: Accepted by ISCSLP 2024
♻ ☆ Language models align with human judgments on key grammatical constructions
Do large language models (LLMs) make human-like linguistic generalizations?
Dentella et al. (2023) ("DGL") prompt several LLMs ("Is the following sentence
grammatically correct in English?") to elicit grammaticality judgments of 80
English sentences, concluding that LLMs demonstrate a "yes-response bias" and a
"failure to distinguish grammatical from ungrammatical sentences". We
re-evaluate LLM performance using well-established practices and find that
DGL's data in fact provide evidence for just how well LLMs capture human
behaviors. Models not only achieve high accuracy overall, but also capture
fine-grained variation in human linguistic judgments.
comment: Published in PNAS at https://www.pnas.org/doi/10.1073/pnas.2400917121
as response to Dentella et al. (2023)
♻ ☆ Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer ECAI 2024
The Mixture of Experts (MoE) has emerged as a highly successful technique in
deep learning, based on the principle of divide-and-conquer to maximize model
capacity without significant additional computational cost. Even in the era of
large-scale language models (LLMs), MoE continues to play a crucial role, as
some researchers have indicated that GPT-4 adopts the MoE structure to ensure
diverse inference results. However, MoE is susceptible to performance
degeneracy, particularly evident in the issues of imbalance and homogeneous
representation among experts. While previous studies have extensively addressed
the problem of imbalance, the challenge of homogeneous representation remains
unresolved. In this study, we shed light on the homogeneous representation
problem, wherein experts in the MoE fail to specialize and lack diversity,
leading to frustratingly high similarities in their representations (up to 99\%
in a well-performed MoE model). This problem restricts the expressive power of
the MoE and, we argue, contradicts its original intention. To tackle this
issue, we propose a straightforward yet highly effective solution: OMoE, an
orthogonal expert optimizer. Additionally, we introduce an alternating training
strategy that encourages each expert to update in a direction orthogonal to the
subspace spanned by other experts. Our algorithm facilitates MoE training in
two key ways: firstly, it explicitly enhances representation diversity, and
secondly, it implicitly fosters interaction between experts during orthogonal
weights computation. Through extensive experiments, we demonstrate that our
proposed optimization algorithm significantly improves the performance of
fine-tuning the MoE model on the GLUE benchmark, SuperGLUE benchmark,
question-answering task, and name entity recognition tasks.
comment: ECAI 2024
♻ ☆ EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles CIKM 2024
This work introduces EUvsDisinfo, a multilingual dataset of disinformation
articles originating from pro-Kremlin outlets, along with trustworthy articles
from credible / less biased sources. It is sourced directly from the debunk
articles written by experts leading the EUvsDisinfo project. Our dataset is the
largest to-date resource in terms of the overall number of articles and
distinct languages. It also provides the largest topical and temporal coverage.
Using this dataset, we investigate the dissemination of pro-Kremlin
disinformation across different languages, uncovering language-specific
patterns targeting certain disinformation topics. We further analyse the
evolution of topic distribution over an eight-year period, noting a significant
surge in disinformation content before the full-scale invasion of Ukraine in
2022. Lastly, we demonstrate the dataset's applicability in training models to
effectively distinguish between disinformation and trustworthy content in
multilingual settings.
comment: Published at CIKM 2024
♻ ☆ Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Large Language Models (LLMs) have performed exceptionally in various
text-generative tasks, including question answering, translation, code
completion, etc. However, the over-assistance of LLMs has raised the challenge
of "jailbreaking", which induces the model to generate malicious responses
against the usage policy and society by designing adversarial prompts. With the
emergence of jailbreak attack methods exploiting different vulnerabilities in
LLMs, the corresponding safety alignment measures are also evolving. In this
paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and
defense methods. For instance, the attack methods are divided into black-box
and white-box attacks based on the transparency of the target model. Meanwhile,
we classify defense methods into prompt-level and model-level defenses.
Additionally, we further subdivide these attack and defense methods into
distinct sub-classes and present a coherent diagram illustrating their
relationships. We also conduct an investigation into the current evaluation
methods and compare them from different perspectives. Our findings aim to
inspire future research and practical implementations in safeguarding LLMs
against adversarial attacks. Above all, although jailbreak remains a
significant concern within the community, we believe that our work enhances the
understanding of this domain and provides a foundation for developing more
secure LLMs.
♻ ☆ Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting
approach for large language models (LLMs), offering unprecedented computational
efficiency. However, these architectures grapple with challenges of token
distribution imbalance and expert homogenization, impeding optimal semantic
generalization. We introduce a novel framework that redefines MoE routing
through affinity-driven active selection. The innovations for the framework
encompass: (1) A rigorous formulation of expert-token affinity metrics. (2) An
adaptive bidirectional selection mechanism leveraging resonance between experts
and tokens. (3) Theoretical derivation and experimental evidence of reduced
expert capacity bounds under dynamic token distribution evolution. It is also
integrated with orthogonal feature extraction module and an optimized loss
function for expert localization. Our theoretical analysis demonstrates that
this approach mitigates expert homogenization while enabling substantial
capacity boundary reduction. Experimental validation corroborates these
findings: it achieves a 40% reduction in token processed by each expert without
compromising model convergence or efficacy. When coupled with communication
optimizations, the training efficiency improvements of 5.4% to 46.6% can be
observed. After supervised fine-tuning, it exhibits performance gains of 9.7%
to 14.1% across GDAD, C-Eval, and TeleQnA benchmarks.
♻ ☆ TaSL: Task Skill Localization and Consolidation for Language Model Continual Learning ACL 2024
Language model continual learning (CL) has recently attracted significant
interest for its ability to adapt large language models (LLMs) to dynamic
real-world scenarios without retraining. A major challenge in this domain is
catastrophic forgetting, where models lose previously acquired knowledge upon
learning new tasks. Existing approaches commonly utilize multiple
parameter-efficient fine-tuning (PEFT) blocks to acquire task-specific
knowledge, yet these methods are inefficient and fail to leverage potential
knowledge transfer across tasks. In this paper, we introduce a novel CL
framework for language models, named Task Skill Localization and Consolidation
(TaSL), which boosts knowledge transfer without depending on memory replay.
TaSL initially segregates the model into 'skill units' based on parameter
dependencies, allowing for more precise control. Subsequently, it employs a
novel group-wise skill localization technique to ascertain the importance
distribution of skill units for a new task. By comparing this importance
distribution with those from previous tasks, we implement a fine-grained skill
consolidation strategy that retains task-specific knowledge, thereby preventing
forgetting, and updates task-shared knowledge, which facilitates bi-directional
knowledge transfer. As a result, TaSL achieves an optimal balance between
retaining prior knowledge and excelling in new tasks. TaSL also demonstrates
strong generalizability, making it suitable for various base models and
adaptable to PEFT methods like LoRA. Furthermore, it offers notable
extensibility, supporting enhancements through integration with memory replay
techniques. Comprehensive experiments conducted on two CL benchmarks, involving
models ranging from 220M to 7B parameters, affirm the effectiveness of TaSL and
its variants across different settings.
comment: Extension of ACL 2024 paper titled: Continual Dialog State Tracking
via Task Skill Localization and Consolidation
♻ ☆ ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages
Recent work shows Large Language Models (LLMs) struggle to understand natural
language constraints for various text generation tasks in zero- and few-shot
settings. While, in the code domain, there is wide usage of constraints in code
format to maintain the integrity of code written in Domain-Specific Languages
(DSLs) like JSON and YAML which are widely used for system-level programming
tasks in enterprises. Given that LLMs are increasingly used for system-level
code tasks, evaluating if they can comprehend these code constraints is
crucial. However, no work has been done to evaluate their controllability over
code constraints. Hence, we introduce ConCodeEval, a first-of-its-kind
benchmark having two novel tasks for code constraints across five
representations. Our findings suggest that language models struggle with code
constraints. Code languages that perform excellently for normal code tasks do
not perform well when the same languages represent fine-grained constraints.
♻ ☆ Contextualized Automatic Speech Recognition with Dynamic Vocabulary
Deep biasing (DB) enhances the performance of end-to-end automatic speech
recognition (E2E-ASR) models for rare words or contextual phrases using a bias
list. However, most existing methods treat bias phrases as sequences of
subwords in a predefined static vocabulary. This naive sequence decomposition
produces unnatural token patterns, significantly lowering their occurrence
probability. More advanced techniques address this problem by expanding the
vocabulary with additional modules, including the external language model
shallow fusion or rescoring. However, they result in increasing the workload
due to the additional modules. This paper proposes a dynamic vocabulary where
bias tokens can be added during inference. Each entry in a bias list is
represented as a single token, unlike a sequence of existing subword tokens.
This approach eliminates the need to learn subword dependencies within the bias
phrases. This method is easily applied to various architectures because it only
expands the embedding and output layers in common E2E-ASR architectures.
Experimental results demonstrate that the proposed method improves the bias
phrase WER on English and Japanese datasets by 3.1 -- 4.9 points compared with
the conventional DB method.
♻ ☆ Causal-Guided Active Learning for Debiasing Large Language Models ACL 2024
Although achieving promising performance, recent analyses show that current
generative large language models (LLMs) may still capture dataset biases and
utilize them for generation, leading to poor generalizability and harmfulness
of LLMs. However, due to the diversity of dataset biases and the
over-optimization problem, previous prior-knowledge-based debiasing methods and
fine-tuning-based debiasing methods may not be suitable for current LLMs. To
address this issue, we explore combining active learning with the causal
mechanisms and propose a casual-guided active learning (CAL) framework, which
utilizes LLMs itself to automatically and autonomously identify informative
biased samples and induce the bias patterns. Then a cost-effective and
efficient in-context learning based method is employed to prevent LLMs from
utilizing dataset biases during generation. Experimental results show that CAL
can effectively recognize typical biased instances and induce various bias
patterns for debiasing LLMs.
comment: Accepted as ACL 2024 main conference & Rewared as Outstanding Paper
♻ ☆ Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
In this paper, we present Cross Language Agent -- Simultaneous
Interpretation, CLASI, a high-quality and human-like Simultaneous Speech
Translation (SiST) System. Inspired by professional human interpreters, we
utilize a novel data-driven read-write strategy to balance the translation
quality and latency. To address the challenge of translating in-domain
terminologies, CLASI employs a multi-modal retrieving module to obtain relevant
information to augment the translation. Supported by LLMs, our approach can
generate error-tolerated translation by considering the input audio, historical
context, and retrieved information. Experimental results show that our system
outperforms other systems by significant margins. Aligned with professional
human interpreters, we evaluate CLASI with a better human evaluation metric,
valid information proportion (VIP), which measures the amount of information
that can be successfully conveyed to the listeners. In the real-world
scenarios, where the speeches are often disfluent, informal, and unclear, CLASI
achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese
translation directions, respectively. In contrast, state-of-the-art commercial
or open-source systems only achieve 35.4% and 41.6%. On the extremely hard
dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70%
VIP.
comment: Authors are listed in alphabetical order by last name. Demonstrations
and human-annotated test sets are available at
https://byteresearchcla.github.io/clasi
♻ ☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
Scientific literature understanding is crucial for extracting targeted
information and garnering insights, thereby significantly advancing scientific
discovery. Despite the remarkable success of Large Language Models (LLMs), they
face challenges in scientific literature understanding, primarily due to (1) a
lack of scientific knowledge and (2) unfamiliarity with specialized scientific
tasks.
To develop an LLM specialized in scientific literature understanding, we
propose a hybrid strategy that integrates continual pre-training (CPT) and
supervised fine-tuning (SFT), to simultaneously infuse scientific domain
knowledge and enhance instruction-following capabilities for domain-specific
tasks.cIn this process, we identify two key challenges: (1) constructing
high-quality CPT corpora, and (2) generating diverse SFT instructions. We
address these challenges through a meticulous pipeline, including PDF text
extraction, parsing content error correction, quality filtering, and synthetic
instruction creation. Applying this strategy, we present a suite of LLMs:
SciLitLLM, specialized in scientific literature understanding. These models
demonstrate promising performance on scientific literature understanding
benchmarks.
Our contributions are threefold: (1) We present an effective framework that
integrates CPT and SFT to adapt LLMs to scientific literature understanding,
which can also be easily adapted to other domains. (2) We propose an LLM-based
synthesis method to generate diverse and high-quality scientific instructions,
resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning
in less-represented scientific domains. (3) SciLitLLM achieves promising
performance improvements on scientific literature understanding benchmarks.
♻ ☆ AgentsCourt: Building Judicial Decision-Making Agents with Court Debate Simulation and Legal Knowledge Augmentation ACL
Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao
With the development of deep learning, natural language processing technology
has effectively improved the efficiency of various aspects of the traditional
judicial industry. However, most current efforts focus on tasks within
individual judicial stages, making it difficult to handle complex tasks that
span multiple stages. As the autonomous agents powered by large language models
are becoming increasingly smart and able to make complex decisions in
real-world settings, offering new insights for judicial intelligence. In this
paper, (1) we propose a novel multi-agent framework, AgentsCourt, for judicial
decision-making. Our framework follows the classic court trial process,
consisting of court debate simulation, legal resources retrieval and
decision-making refinement to simulate the decision-making of judge. (2) we
introduce SimuCourt, a judicial benchmark that encompasses 420 Chinese judgment
documents, spanning the three most common types of judicial cases. Furthermore,
to support this task, we construct a large-scale legal knowledge base,
Legal-KB, with multi-resource legal knowledge. (3) Extensive experiments show
that our framework outperforms the existing advanced methods in various
aspects, especially in generating legal articles, where our model achieves
significant improvements of 8.6% and 9.1% F1 score in the first and second
instance settings, respectively.
comment: This paper was first submitted to ACL ARR 2024 April (Under review)
♻ ☆ Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Large-scale neural network models combining text and images have made
incredible progress in recent years. However, it remains an open question to
what extent such models encode compositional representations of the concepts
over which they operate, such as correctly identifying "red cube" by reasoning
over the constituents "red" and "cube". In this work, we focus on the ability
of a large pretrained vision and language model (CLIP) to encode compositional
concepts and to bind variables in a structure-sensitive way (e.g.,
differentiating "cube behind sphere" from "sphere behind cube"). To inspect the
performance of CLIP, we compare several architectures from research on
compositional distributional semantics models (CDSMs), a line of research that
attempts to implement traditional compositional linguistic structures within
embedding spaces. We benchmark them on three synthetic datasets -
single-object, two-object, and relational - designed to test concept binding.
We find that CLIP can compose concepts in a single-object setting, but in
situations where concept binding is needed, performance drops dramatically. At
the same time, CDSMs also perform poorly, with best performance at chance
level.
comment: Lewis and Nayak contributed equally
♻ ☆ Measuring Dimensions of Self-Presentation in Twitter Bios and their Links to Misinformation Sharing
Social media platforms provide users with a profile description field,
commonly known as a ``bio," where they can present themselves to the world. A
growing literature shows that text in these bios can improve our understanding
of online self-presentation and behavior, but existing work relies exclusively
on keyword-based approaches to do so. We here propose and evaluate a suite of
\hl{simple, effective, and theoretically motivated} approaches to embed bios in
spaces that capture salient dimensions of social meaning, such as age and
partisanship. We \hl{evaluate our methods on four tasks, showing that the
strongest one out-performs several practical baselines.} We then show the
utility of our method in helping understand associations between
self-presentation and the sharing of URLs from low-quality news sites on
Twitter\hl{, with a particular focus on explore the interactions between age
and partisanship, and exploring the effects of self-presentations of
religiosity}. Our work provides new tools to help computational social
scientists make use of information in bios, and provides new insights into how
misinformation sharing may be perceived on Twitter.
♻ ☆ Token-level Direct Preference Optimization
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align
them with human values and intentions. This process often utilizes methods like
pairwise comparisons and KL divergence against a reference LLM, focusing on the
evaluation of full answers generated by the models. However, the generation of
these responses occurs in a token level, following a sequential,
auto-regressive fashion. In this paper, we introduce Token-level Direct
Preference Optimization (TDPO), a novel approach to align LLMs with human
preferences by optimizing policy at the token level. Unlike previous methods,
which face challenges in divergence efficiency, TDPO incorporates forward KL
divergence constraints for each token, improving alignment and diversity.
Utilizing the Bradley-Terry model for a token-based reward system, TDPO
enhances the regulation of KL divergence, while preserving simplicity without
the need for explicit reward modeling. Experimental results across various text
tasks demonstrate TDPO's superior performance in balancing alignment with
generation diversity. Notably, fine-tuning with TDPO strikes a better balance
than DPO in the controlled sentiment generation and single-turn dialogue
datasets, and significantly improves the quality of generated responses
compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at
https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
♻ ☆ Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Recent advances in language models have achieved significant progress.
GPT-4o, as a new milestone, has enabled real-time conversations with humans,
demonstrating near-human natural fluency. Such human-computer interaction
necessitates models with the capability to perform reasoning directly with the
audio modality and generate output in streaming. However, this remains beyond
the reach of current academic models, as they typically depend on extra TTS
systems for speech synthesis, resulting in undesirable latency. This paper
introduces the Mini-Omni, an audio-based end-to-end conversational model,
capable of real-time speech interaction. To achieve this capability, we propose
a text-instructed speech generation method, along with batch-parallel
strategies during inference to further boost the performance. Our method also
helps to retain the original model's language capabilities with minimal
degradation, enabling other works to establish real-time interaction
capabilities. We call this training method "Any Model Can Talk". We also
introduce the VoiceAssistant-400K dataset to fine-tune models optimized for
speech output. To our best knowledge, Mini-Omni is the first fully end-to-end,
open-source model for real-time speech interaction, offering valuable potential
for future research.
comment: Technical report, work in progress. Demo and code:
https://github.com/gpt-omni/mini-omni
♻ ☆ Advancing Chinese biomedical text mining with community challenges
Hui Zong, Rongrong Wu, Jiaxue Cha, Weizhe Feng, Erman Wu, Jiakun Li, Aibin Shao, Liang Tao, Zuofeng Li, Buzhou Tang, Bairong Shen
Objective: This study aims to review the recent advances in community
challenges for biomedical text mining in China. Methods: We collected
information of evaluation tasks released in community challenges of biomedical
text mining, including task description, dataset description, data source, task
type and related links. A systematic summary and comparative analysis were
conducted on various biomedical natural language processing tasks, such as
named entity recognition, entity normalization, attribute extraction, relation
extraction, event extraction, text classification, text similarity, knowledge
graph construction, question answering, text generation, and large language
model evaluation. Results: We identified 39 evaluation tasks from 6 community
challenges that spanned from 2017 to 2023. Our analysis revealed the diverse
range of evaluation task types and data sources in biomedical text mining. We
explored the potential clinical applications of these community challenge tasks
from a translational biomedical informatics perspective. We compared with their
English counterparts, and discussed the contributions, limitations, lessons and
guidelines of these community challenges, while highlighting future directions
in the era of large language models. Conclusion: Community challenge evaluation
competitions have played a crucial role in promoting technology innovation and
fostering interdisciplinary collaboration in the field of biomedical text
mining. These challenges provide valuable platforms for researchers to develop
state-of-the-art solutions.
♻ ☆ Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Amey Agrawal, Anmol Agarwal, Nitin Kedia, Jayashree Mohan, Souvik Kundu, Nipun Kwatra, Ramachandran Ramjee, Alexey Tumanov
Serving large language models (LLMs) in production can incur substantial
costs, which has prompted recent advances in inference system optimizations.
Today, these systems are evaluated against conventional latency and throughput
metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics
fail to fully capture the nuances of LLM inference, leading to an incomplete
assessment of user-facing performance crucial for real-time applications such
as chat and translation. In this paper, we first identify the pitfalls of
current performance metrics in evaluating LLM inference systems. We then
propose Etalon, a comprehensive performance evaluation framework that includes
fluidity-index -- a novel metric designed to reflect the intricacies of the LLM
inference process and its impact on real-time user experience. Finally, we
evaluate various existing open-source platforms and model-as-a-service
offerings using Etalon, discussing their strengths and weaknesses. Etalon is
available at https://github.com/project-etalon/etalon.
♻ ☆ Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment
Learning to ground natural language queries to target objects or regions in
3D point clouds is quite essential for 3D scene understanding. Nevertheless,
existing 3D visual grounding approaches require a substantial number of
bounding box annotations for text queries, which is time-consuming and
labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly
supervised approach for 3D visual grounding based on Visual Linguistic
Alignment. Our 3D-VLA exploits the superior ability of current large-scale
vision-language models (VLMs) on aligning the semantics between texts and 2D
images, as well as the naturally existing correspondences between 2D images and
3D point clouds, and thus implicitly constructs correspondences between texts
and 3D point clouds with no need for fine-grained box annotations in the
training procedure. During the inference stage, the learned text-3D
correspondence will help us ground the text queries to the 3D target objects
even without 2D images. To the best of our knowledge, this is the first work to
investigate 3D visual grounding in a weakly supervised manner by involving
large scale vision-language models, and extensive experiments on ReferIt3D and
ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even
superior results over the fully supervised methods.